Analyzing the community activity for version control systems

Context

Welcome to the Jupyter Lab exercise where you execute your first data analysis of software data in a Data Science way!

Exercise

Background

Technology choices are different. There may be objective reasons for technology at a specific time. But those reasons often change over time. But the developed deep love for an now outdated technology can prevent every progress. Thus objective reasons may become subjective which can create a toxic environment when technology updates are addressed.

Your task

You are a new team member in a software company. The developers there are using a version control system ("VCS" for short) called CVS (Concurrent Versions System). Some want to migrate to a better VCS. They prefer one that's called SVN (Subversion). You are young but not inexperienced. You heard about newer version control system named "Git". So you propose Git as an alternative to the team. They are very sceptical about your suggestion. Find evidence that shows that the software development community is mainly adopting the Git version control system!

The Dataset

There is a dataset from the online software developer community Stack Overflow in ../datasets/stackoverflow_vcs_data_subset.gz available with the following data:

  • CreationDate: the timestamp of the creation date of a Stack Overflow post (= question)
  • TagName: the tag name for a technology (in our case for only 4 VCSes: "cvs", "svn", "git" and "mercurial")
  • ViewCount: the numbers of views of a post

These are the first 10 entries of this dataset:

CreationDate,TagName,ViewCount
2008-08-01 13:56:33,svn,10880
2008-08-01 14:41:24,svn,55075
2008-08-01 15:22:29,svn,15144
2008-08-01 18:00:13,svn,8010
2008-08-01 18:33:08,svn,92006
2008-08-01 23:29:32,svn,2444
2008-08-03 22:38:29,svn,871830
2008-08-03 22:38:29,git,871830
2008-08-04 11:37:24,svn,17969

Code Snippets

Move the following code blocks with your mouse below the appropriate steps. For this, click on the left area beneath the code block for drag and drop:



In [ ]:
number_of_views = vcs_data.groupby(['CreationDate', 'TagName']).sum()
number_of_views.head()

In [ ]:
%matplotlib inline
monythly_views.plot(title="monthly stackoverflow post views");

In [ ]:
vcs_data['CreationDate'] = pd.to_datetime(vcs_data['CreationDate'])
vcs_data.head()

In [ ]:
import pandas as pd

vcs_data = pd.read_csv('../datasets/stackoverflow_vcs_data_subset.gz')
vcs_data.head()

In [ ]:
views_per_vcs = number_of_views.unstack()['ViewCount']
views_per_vcs.head()

In [ ]:
monythly_views = views_per_vcs.resample("1M").sum().cumsum()
monythly_views.head()

Your solution

Step 1: Load in the dataset

Step 2: Convert the CreationDate column to a real datetime datatype

Step 3: Sum up the number of views in ViewCount by the timestamp and the VCSes

Step 4: List the number of views for each VCS in separate columns

Step 5: Accumulate the number of views for the VCSes for every month

Step 6: Visualize the monthly views over time for all VCSes

Execution

You can execute single code blocks by clicking in them and pressing Strg + Enter.

To execute all cells, select "Run" and "Run All Cells" in the menu bar in the upper left.

Discussion

What are your conclusions? Discuss!